{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a7oq3cfnync",
   "metadata": {},
   "source": [
    "# Extracting Repeating Entities from Documents\n",
    "\n",
    "This notebook demonstrates how to use the `PER_TABLE_ROW` extraction target to extract structured data from documents containing repeating entities like tables, lists, or catalogs.\n",
    "\n",
    "## Why Use the Tabular Extraction Target?\n",
    "\n",
    "`PER_DOC` (refer to the table below for a quick overview of the different extraction targets) is the default extraction target in LlamaExtract, which looks at the entire document's context when doing an extraction. When extracting lists of entities, LLM-based extraction has a critical failure mode — it often **only extracts the first few tens of entries** from a long list. This happens because LLMs have limited attention spans for repetitive data. Document-level extraction doesn't guarantee exhaustive coverage, and long lists lead to incomplete extractions.\n",
    "\n",
    "**The Solution**: `PER_TABLE_ROW` solves this by processing each entity individually or in smaller batches, ensuring **exhaustive extraction** of all entries regardless of list length.\n",
    "\n",
    "### Entity-Level Extraction\n",
    "\n",
    "When using `extraction_target=ExtractTarget.PER_TABLE_ROW`, you define a schema for a **single entity** (e.g., one hospital, one product, one invoice line item), not the full document. LlamaExtract automatically:\n",
    "- Detects the formatting patterns that distinguish individual entities (table rows, list items, section headers, etc.)\n",
    "- Applies your schema to each identified entity\n",
    "- Returns a `list[YourSchema]` with one object per entity\n",
    "\n",
    "This approach is ideal when each entity locally contains all the information needed for your schema.\n",
    "\n",
    "### Choosing the Right Extraction Target\n",
    "\n",
    "| Extraction Target | Best For | Returns |\n",
    "|-------------------|----------|---------|\n",
    "| `PER_DOC` | Single-entity documents, summaries, or short lists | One JSON object for entire document |\n",
    "| `PER_PAGE` | Multi-page documents where each page is independent | One JSON object per page |\n",
    "| `PER_TABLE_ROW` | **Long lists, tables, catalogs with repeating entities** | List of JSON objects (one per entity) |\n",
    "\n",
    "📖 For more details, see the [Extraction Target documentation](https://developers.llamaindex.ai/python/cloud/llamaextract/features/concepts/#extraction-target)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9427d1de",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "from llama_cloud_services import LlamaExtract\n",
    "\n",
    "\n",
    "# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)\n",
    "load_dotenv(override=True)\n",
    "\n",
    "# Optionally, add your project id/organization id\n",
    "llama_extract = LlamaExtract()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4426b360",
   "metadata": {},
   "source": [
    "## Table of Hospitals by County and Insurance Plans\n",
    "\n",
    "We have a PDF document with a list of hospitals by county and different insurance plans offered by Blue Shield of California. \n",
    "\n",
    "\n",
    "![First few entries from the PDF](./data/tables/bsc_page1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c86sjymhn1r",
   "metadata": {},
   "source": [
    "We want to extract each hospital from this table along with a list of applicable insurance plans. \n",
    "\n",
    "### Example 1: Structured Table\n",
    "\n",
    "This is an ideal use case for `PER_TABLE_ROW` extraction:\n",
    "- **Clear structure**: The document has explicit table formatting with rows and columns\n",
    "- **Repeating entities**: Each row represents one hospital with consistent attributes\n",
    "- **Local information**: All data for each hospital (county, name, plans) is contained within its row\n",
    "\n",
    "Notice that our `Hospital` schema describes a **single hospital**, not the full document. LlamaExtract will return a `list[Hospital]` with one entry per table row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c61a802",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class Hospital(BaseModel):\n",
    "    \"\"\"List of hospitals by county available for different BSC plans\"\"\"\n",
    "\n",
    "    county: str = Field(description=\"County name\")\n",
    "    hospital_name: str = Field(description=\"Name of the hospital\")\n",
    "    plan_names: list[str] = Field(\n",
    "        description=\"List of plans available at the hospital. One of: Trio HMO, SaveNet, Access+ HMO, BlueHPN PPO, Tandem PPO, PPO\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8a69b7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_cloud_services.extract import ExtractConfig, ExtractMode, ExtractTarget\n",
    "\n",
    "\n",
    "result = await llama_extract.aextract(\n",
    "    data_schema=Hospital,\n",
    "    files=\"./data/tables/BSC-Hospital-List-by-County.pdf\",\n",
    "    config=ExtractConfig(\n",
    "        extraction_mode=ExtractMode.PREMIUM,\n",
    "        extraction_target=ExtractTarget.PER_TABLE_ROW,\n",
    "        parse_model=\"anthropic-sonnet-4.5\",\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43722cda",
   "metadata": {},
   "source": [
    "### Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95b5aca6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "380"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(result.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e355770",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'county': 'Alameda',\n",
       "  'hospital_name': 'Alameda Hospital',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'SaveNet',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Alta Bates Med Ctr Herrick Campus',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Alta Bates Summit Med Ctr Alta Bates Campus',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Alta Bates Summit Med Ctr Summit Campus',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Alta Bates Summit Medical Center',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'BHC Fremont Hospital',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'SaveNet',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Centre For Neuro Skills San Francisco',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'SaveNet',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Eden Medical Center',\n",
       "  'plan_names': ['Trio HMO', 'Access+ HMO', 'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Fairmont Hospital',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'SaveNet',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']},\n",
       " {'county': 'Alameda',\n",
       "  'hospital_name': 'Highland Hospital',\n",
       "  'plan_names': ['Trio HMO',\n",
       "   'SaveNet',\n",
       "   'Access+ HMO',\n",
       "   'BlueHPN PPO',\n",
       "   'Tandem PPO',\n",
       "   'PPO']}]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.data[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e28f0de8",
   "metadata": {},
   "source": [
    "![](./data/tables/bsc_results.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "di156pb7s6j",
   "metadata": {},
   "source": [
    "**Success!** We extracted all **380 hospitals** from the multi-page PDF. Each entity was correctly parsed with its county, hospital name, and applicable insurance plans. With `PER_DOC`, we would likely have only gotten the first 20-30 entries."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "gelvl6db268",
   "metadata": {},
   "source": [
    "## Extracting from a Toy Catalog\n",
    "\n",
    "### Example 2: Semi-Structured List\n",
    "\n",
    "The `PER_TABLE_ROW` extraction target also works well for documents that aren't explicit tables but have similar properties:\n",
    "- **Ordered listing**: The toys are listed sequentially with visual separation (section headers, spacing)\n",
    "- **Repeating pattern**: Each toy entry has a consistent structure (code, name, specs, description)\n",
    "- **Local information**: All attributes for each toy are grouped together in its entry\n",
    "\n",
    "Even though this isn't a traditional table format, each toy entity locally contains all the information needed for our schema. LlamaExtract detects the formatting patterns that distinguish each toy and extracts them as separate entities.\n",
    "\n",
    "![](./data/tables/toy_catalog_page.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cf0b2db",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class ToyCatalog(BaseModel):\n",
    "    \"\"\"Product information from a toy catalog.\"\"\"\n",
    "\n",
    "    section_name: str = Field(\n",
    "        description=\"The name of the toy section (e.g. Table Toys, Active Toys).\"\n",
    "    )\n",
    "    product_code: str = Field(\n",
    "        description=\"The unique product code for the toy (e.g., GA457).\"\n",
    "    )\n",
    "    toy_name: str = Field(description=\"The name of the toy.\")\n",
    "    age_range: str = Field(\n",
    "        description=\"The recommended age range for the toy (e.g., 6 +, 4 +).\",\n",
    "    )\n",
    "    player_range: str = Field(\n",
    "        description=\"The number of players the toy is designed for (e.g., 2, 2-4, 1-6).\",\n",
    "    )\n",
    "    material: str = Field(\n",
    "        description=\"The primary material(s) the toy is made of (e.g., wood, cardboard).\",\n",
    "    )\n",
    "    description: str = Field(\n",
    "        description=\"A brief description of the toy and its components and dimensions.\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mysu1i2qo9e",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "Again, our schema represents a **single toy product**, not the entire catalog. The system will return a `list[ToyCatalog]` with one entry per toy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b38b806",
   "metadata": {},
   "outputs": [],
   "source": [
    "result = await llama_extract.aextract(\n",
    "    data_schema=ToyCatalog,\n",
    "    files=\"./data/tables/Click-BS-Toys-Catalogue-2024.pdf\",\n",
    "    config=ExtractConfig(\n",
    "        extraction_mode=ExtractMode.PREMIUM,\n",
    "        extraction_target=ExtractTarget.PER_TABLE_ROW,\n",
    "        parse_model=\"anthropic-sonnet-4.5\",\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91aface0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "153"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(result.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51278736",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'section_name': 'Table Toys',\n",
       "  'product_code': 'GA457',\n",
       "  'toy_name': 'Dots and Boxes',\n",
       "  'age_range': '6+',\n",
       "  'player_range': '2',\n",
       "  'material': 'wood',\n",
       "  'description': 'base 17x17 cm\\n50 border pieces 4x1,2x0,3 cm\\n34 trees 2,6x1,4 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA456',\n",
       "  'toy_name': '3 In a Row',\n",
       "  'age_range': '8+',\n",
       "  'player_range': '2',\n",
       "  'material': 'wood, pine, cardboard',\n",
       "  'description': 'base 24x22,5x2,5 cm\\n30 cards 5,5x5 cm\\n6 chips'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA467',\n",
       "  'toy_name': 'Which Cow am i?',\n",
       "  'age_range': '6+',\n",
       "  'player_range': '2',\n",
       "  'material': 'wood, beech',\n",
       "  'description': '2 cow bases 56x4x4,5 cm\\n16 cards 4x5 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA460',\n",
       "  'toy_name': 'Balance Bunnies',\n",
       "  'age_range': '4+',\n",
       "  'player_range': '2',\n",
       "  'material': 'wood',\n",
       "  'description': '1 base 35x12x25 cm\\n7 bunnies 7 foxes\\n1 dice 3 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA462',\n",
       "  'toy_name': 'Color Combination Race',\n",
       "  'age_range': '4+',\n",
       "  'player_range': '2-4',\n",
       "  'material': 'wood, cardboard',\n",
       "  'description': 'base 6,5x6,5x15 cm, rings 5,5x5,5x0,5 mm\\ncardholder 6x6x2 cm, cards 5,5x5,5 cm\\ncolor cards Ø 15,5 cm - Ø 7 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA465',\n",
       "  'toy_name': 'Plop It',\n",
       "  'age_range': '6+',\n",
       "  'player_range': '2-4',\n",
       "  'material': 'wood, elastic, cardboard',\n",
       "  'description': 'Catch the right balls and plop them in the net!\\n* 2 ploppers 8x5 cm\\n* 2 net holders Ø 5cm, length 55 cm\\n* 6 cards 1,5x2,5 cm, 30 balls Ø 2,5 cm\\n* 1 rope 120 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA466',\n",
       "  'toy_name': 'Whack a Shape',\n",
       "  'age_range': '4+',\n",
       "  'player_range': '2-4',\n",
       "  'material': 'wood',\n",
       "  'description': '* base 38,5x15,5 cm\\n* 2 stands 36 half balls, 4 hammers\\n* 1 dice 2,5 cm\\n* 4 cards'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA458',\n",
       "  'toy_name': 'Sling Puck | Table Hockey',\n",
       "  'age_range': '6+',\n",
       "  'player_range': '2',\n",
       "  'material': 'wood',\n",
       "  'description': '* double sides base 39x21x3 cm\\n* 10 chips Ø 2,5 cm\\n* 2 pushers 4x4x3 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA039',\n",
       "  'toy_name': 'DIY Birdhouse',\n",
       "  'age_range': '3+',\n",
       "  'player_range': '1',\n",
       "  'material': 'wood',\n",
       "  'description': '* house 9x9x13 cm'},\n",
       " {'section_name': 'Table Toys',\n",
       "  'product_code': 'GA319',\n",
       "  'toy_name': 'Triangle Domino',\n",
       "  'age_range': '6+',\n",
       "  'player_range': '2-4',\n",
       "  'material': 'wood',\n",
       "  'description': '* 35 triangles 10x10 x10 cm'}]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result.data[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1810c0a",
   "metadata": {},
   "source": [
    "![](./data/tables/toy_catalog_results.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ezur9gnhmsb",
   "metadata": {},
   "source": [
    "**Success!** Despite the semi-structured format, we extracted all **152 toy products** from the catalog (there's an extra repeated extracted toy from the Appendix section). LlamaExtract automatically detected the visual patterns separating each toy entry and applied our schema to each one."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aeyr3io29u",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "The `PER_TABLE_ROW` extraction target is powerful for extracting repeating structured entities from documents. Key takeaways:\n",
    "\n",
    "1. **Schema design**: Define your schema for a single entity, not the full document. The system returns `list[YourSchema]`.\n",
    "\n",
    "2. **Works with various formats**: Not just traditional tables—any document with distinguishable repeating entities (bullets, numbering, headers, visual separation, etc.). The common requirement is that each entity should contain all the necessary data for your schema within its local context.\n",
    "\n",
    "3. **Automatic pattern detection**: LlamaExtract identifies the formatting patterns that distinguish entities and applies your schema to each one."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
